Goto

Collaborating Authors

 cost efficiency


SpecEdge: Scalable Edge-Assisted Serving Framework for Interactive LLMs

Park, Jinwoo, Cho, Seunggeun, Han, Dongsu

arXiv.org Artificial Intelligence

Large language models (LLMs) power many modern applications, but serving them at scale remains costly and resource-intensive. Current server-centric systems overlook consumer-grade GPUs at the edge. We introduce SpecEdge, an edge-assisted inference framework that splits LLM workloads between edge and server GPUs using a speculative decoding scheme, exchanging only token outputs over the network. SpecEdge employs proactive edge drafting to overlap edge token creation with server verification and pipeline-aware scheduling that interleaves multiple user requests to increase server-side throughput. Experiments show SpecEdge enhances overall cost efficiency by 1.91x through achieving 2.22x server throughput, and reduces inter token latency by 11.24% compared to a server-only baseline, introducing a scalable, cost-effective paradigm for LLM serving. The code is available at https://github.com/kaist-ina/specedge


SLED: A Speculative LLM Decoding Framework for Efficient Edge Serving

Li, Xiangchen, Spatharakis, Dimitrios, Ghafouri, Saeid, Fan, Jiakun, Vandierendonck, Hans, John, Deepu, Ji, Bo, Nikolopoulos, Dimitrios

arXiv.org Artificial Intelligence

The growing gap between the increasing complexity of large language models (LLMs) and the limited computational budgets of edge devices poses a key challenge for efficient on-device inference, despite gradual improvements in hardware capabilities. Existing strategies, such as aggressive quantization, pruning, or remote inference, trade accuracy for efficiency or lead to substantial cost burdens. This position paper introduces a new framework that leverages speculative decoding, previously viewed primarily as a decoding acceleration technique for autoregressive generation of LLMs, as a promising approach specifically adapted for edge computing by orchestrating computation across heterogeneous devices. We propose \acronym, a framework that allows lightweight edge devices to draft multiple candidate tokens locally using diverse draft models, while a single, shared edge server verifies the tokens utilizing a more precise target model. To further increase the efficiency of verification, the edge server batch the diverse verification requests from devices. This approach supports device heterogeneity and reduces server-side memory footprint by sharing the same upstream target model across multiple devices. Our initial experiments with Jetson Orin Nano, Raspberry Pi 4B/5, and an edge server equipped with 4 Nvidia A100 GPUs indicate substantial benefits: 2.2 more system throughput, 2.8 more system capacity, and better cost efficiency, all without sacrificing model accuracy.


Cost-Efficient LLM Serving in the Cloud: VM Selection with KV Cache Offloading

Kim, Kihyun, Kim, Jinwoo, Chung, Hyunsun, Cha, Myung-Hoon, Kim, Hong-Yeon, Kim, Youngjae

arXiv.org Artificial Intelligence

LLM inference is essential for applications like text summarization, translation, and data analysis, but the high cost of GPU instances from Cloud Service Providers (CSPs) like AWS is a major burden. This paper proposes InferSave, a cost-efficient VM selection framework for cloud based LLM inference. InferSave optimizes KV cache offloading based on Service Level Objectives (SLOs) and workload charac teristics, estimating GPU memory needs, and recommending cost-effective VM instances. Additionally, the Compute Time Calibration Function (CTCF) improves instance selection accuracy by adjusting for discrepancies between theoretical and actual GPU performance. Experiments on AWS GPU instances show that selecting lower-cost instances without KV cache offloading improves cost efficiency by up to 73.7% for online workloads, while KV cache offloading saves up to 20.19% for offline workloads.


ECCOS: Efficient Capability and Cost Coordinated Scheduling for Multi-LLM Serving

Mei, Kai, Xu, Wujiang, Lin, Shuhang, Zhang, Yongfeng

arXiv.org Artificial Intelligence

As large language models (LLMs) are increasingly deployed as service endpoints in systems, the surge in query volume creates significant scheduling challenges. Existing scheduling frameworks mainly target at latency optimization while neglecting the capability of LLMs to serve different level of queries, which could lead to computational resource waste. This paper addresses this challenge by proposing a capability-cost coordinated scheduling framework, ECCOS, for multi-LLM serving, which explicitly constrains response quality and workload to optimize LLM inference cost. Specifically, it introduces the two-stage scheduling by designing a multi-objective predictor and a constrained optimizer. The predictor estimates both model capabilities and computational costs through training-based and retrieval-based approaches, while the optimizer determines cost-optimal assignments under quality and workload constraints. It also introduces QAServe, a dataset collected for sample-wise response quality and costs by zero-shot prompting different LLMs on knowledge QA and mathematical reasoning. Extensive experiments demonstrate that ECCOS improves success rates by 6.30% while reducing costs by 10.15% compared to existing methods, consuming less than 0.5% of LLM response time. The code is available at: https://github.com/agiresearch/ECCOS.


Enhancing Supply Chain Resilience with Metaverse and ChatGPT Technologies

Sarhir, Oumaima

arXiv.org Artificial Intelligence

Global supply lines have been severely disrupted by the COVID-19 epidemic and the conflict between Russia and Ukraine, which has sharply increased the price of commodities and generated inflation. These incidents highlight how critical it is to improve supply chain resilience (SCRES) in order to fend off unforeseen setbacks. Controlling both internal and external interruptions, such as transportation problems brought on by natural catastrophes and wars, is the responsibility of SCRES. Enhancing resilience in supply chains requires accurate and timely information transfer. Promising answers to these problems can be found in the Metaverse and ChatGPT, two new digital technologies. The Metaverse may imitate real-world situations and offer dynamic, real-time 3D representations of supply chain data by integrating blockchain, IoT, network connection, and computer power.Large-scale natural language processing model ChatGPT improves communication and data translation accuracy and speed. To manage risk and facilitate decision making in Supply Chain management, firms should increase information transmission, Speed and quality. This study aim to show the importance of ChatGPT and Metaverse technologies to improve SCRES, with an emphasis on the most important criteria for SCRES, and maturity factor that can influence directly the SC development.


AI-driven innovation in medicaid: enhancing access, cost efficiency, and population health management

Ingole, Balaji Shesharao, Ramineni, Vishnu, Krishnappa, Manjunatha Sughaturu, Jayaram, Vivekananda

arXiv.org Artificial Intelligence

Medicaid is a federal-state program that provides healthcare to over 80 million low-income Americans, including pregnant women, children, and individuals with disabilities. Up against a host of problems, including rising healthcare costs, disparity in access, and the management of chronic conditions among at-risk groups, Medicaid is one of the biggest healthcare payers in the U.S. Just as Medicare does, the use of Artificial Intelligence (AI) offers a major opportunity to change the delivery of care and operational efficiency in Medicaid [1] [16]. While there has been extensive conversation about AI in Medicare, the unique population and requirements of Medicaid require customized AI applications [1]. Chronic disease management, improving admin tasks, and a reduction in costs are amongst the ways AI tools can help, especially by focusing on social determinants of health (SDOH) that are important for Medicaid populations. The study will assess the ability of AI-enabled systems to reinforce Medicaid in handling its particular challenges while facilitating fair and quality care for its entire population of beneficiaries [8] [9].


M\'elange: Cost Efficient Large Language Model Serving by Exploiting GPU Heterogeneity

Griggs, Tyler, Liu, Xiaoxuan, Yu, Jiaxiang, Kim, Doyoung, Chiang, Wei-Lin, Cheung, Alvin, Stoica, Ion

arXiv.org Artificial Intelligence

Large language models (LLMs) are increasingly integrated into many online services, yet they remain cost-prohibitive to deploy due to the requirement of expensive GPU instances. Prior work has addressed the high cost of LLM serving by improving the inference engine, but less attention has been given to selecting the most cost-efficient GPU type(s) for a specific LLM service. There is a large and growing landscape of GPU types and, within these options, higher cost does not always lead to increased performance. Instead, through a comprehensive investigation, we find that three key LLM service characteristics (request size, request rate, SLO) strongly influence GPU cost efficiency, and differing GPU types are most cost efficient for differing LLM service settings. As a result, the most cost-efficient allocation for a given service is typically a mix of heterogeneous GPU types. Based on this analysis, we introduce M\'elange, a GPU allocation framework that navigates these diverse LLM service characteristics and heterogeneous GPU option space to automatically and efficiently derive the minimal-cost GPU allocation for a given LLM service. We formulate the GPU allocation task as a cost-aware bin packing problem where GPUs are bins and items are slices of the service workload. Our formulation's constraints account for a service's unique characteristics, allowing M\'elange to be flexible to support diverse service settings and heterogeneity-aware to adapt the GPU allocation to a specific service. Compared to using only a single GPU type, M\'elange reduces deployment costs by up to 77% in conversational settings, 33% in document-based settings, and 51% in a mixed setting.


Efficient and Economic Large Language Model Inference with Attention Offloading

Chen, Shaoyuan, Lin, Yutong, Zhang, Mingxing, Wu, Yongwei

arXiv.org Artificial Intelligence

Transformer-based large language models (LLMs) exhibit impressive performance in generative tasks but introduce significant challenges in real-world serving due to inefficient use of the expensive, computation-optimized accelerators. This mismatch arises from the autoregressive nature of LLMs, where the generation phase comprises operators with varying resource demands. Specifically, the attention operator is memory-intensive, exhibiting a memory access pattern that clashes with the strengths of modern accelerators, especially as context length increases. To enhance the efficiency and cost-effectiveness of LLM serving, we introduce the concept of attention offloading. This approach leverages a collection of cheap, memory-optimized devices for the attention operator while still utilizing high-end accelerators for other parts of the model. This heterogeneous setup ensures that each component is tailored to its specific workload, maximizing overall performance and cost efficiency. Our comprehensive analysis and experiments confirm the viability of splitting the attention computation over multiple devices. Also, the communication bandwidth required between heterogeneous devices proves to be manageable with prevalent networking technologies. To further validate our theory, we develop Lamina, an LLM inference system that incorporates attention offloading. Experimental results indicate that Lamina can provide 1.48x-12.1x higher estimated throughput per dollar than homogeneous solutions.


Improve Cost Efficiency of Active Learning over Noisy Dataset

Chong, Zan-Kai, Ohsaki, Hiroyuki, Ng, Bryan

arXiv.org Artificial Intelligence

Active learning is a learning strategy whereby the machine learning algorithm actively identifies and labels data points to optimize its learning. This strategy is particularly effective in domains where an abundance of unlabeled data exists, but the cost of labeling these data points is prohibitively expensive. In this paper, we consider cases of binary classification, where acquiring a positive instance incurs a significantly higher cost compared to that of negative instances. For example, in the financial industry, such as in money-lending businesses, a defaulted loan constitutes a positive event leading to substantial financial loss. To address this issue, we propose a shifted normal distribution sampling function that samples from a wider range than typical uncertainty sampling. Our simulation underscores that our proposed sampling function limits both noisy and positive label selection, delivering between 20% and 32% improved cost efficiency over different test datasets.


CoAnnotating: Uncertainty-Guided Work Allocation between Human and Large Language Models for Data Annotation

Li, Minzhi, Shi, Taiwei, Ziems, Caleb, Kan, Min-Yen, Chen, Nancy F., Liu, Zhengyuan, Yang, Diyi

arXiv.org Artificial Intelligence

Annotated data plays a critical role in Natural Language Processing (NLP) in training models and evaluating their performance. Given recent developments in Large Language Models (LLMs), models such as ChatGPT demonstrate zero-shot capability on many text-annotation tasks, comparable with or even exceeding human annotators. Such LLMs can serve as alternatives for manual annotation, due to lower costs and higher scalability. However, limited work has leveraged LLMs as complementary annotators, nor explored how annotation work is best allocated among humans and LLMs to achieve both quality and cost objectives. We propose CoAnnotating, a novel paradigm for Human-LLM co-annotation of unstructured texts at scale. Under this framework, we utilize uncertainty to estimate LLMs' annotation capability. Our empirical study shows CoAnnotating to be an effective means to allocate work from results on different datasets, with up to 21% performance improvement over random baseline. For code implementation, see https://github.com/SALT-NLP/CoAnnotating.